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Abstract 

In recent years, several probabilistic techniques have been ap¬ 
plied to various debugging problems. However, most existing 
probabilistic debugging systems use relatively simple statis¬ 
tical models, and fail to generalize across multiple programs. 
In this work, we propose Tractable Fault Localization Models 
(TFLMs) that can be learned from data, and probabilistically 
infer the location of the bug. While most previous statisti¬ 
cal debugging methods generalize over many executions of 
a single program, TFLMs are trained on a corpus of previ¬ 
ously seen buggy programs, and learn to identify recurring 
patterns of bugs. Widely-used fault localization techniques 
such as Tarantula evaluate the suspiciousness of each line 
in isolation; in contrast, a TFLM defines a joint probability 
distribution over buggy indicator variables for each line. Joint 
distributions with rich dependency structure are often com¬ 
putationally intractable; TFLMs avoid this by exploiting re¬ 
cent developments in tractable probabilistic models (specif¬ 
ically, Relational SPNs). Further, TFLMs can incorporate 
additional sources of information, including coverage-based 
features such as TARANTULA. We evaluate the fault localiza¬ 
tion performance of TFLMs that include TARANTULA scores 
as features in the probabilistic model. Our study shows that 
the learned TFLMs isolate bugs more effectively than previ¬ 
ous statistical methods or using TARANTULA directly. 


Introduction 


According to a 2002 NIST study ( RTI International 2002| l, 
software bugs cost the US economy an estimated $59.5 bil- 
lion per year. While some of these costs are unavoidable, the 
report claimed that an estimated $22.2 billion could be saved 
with more effective tools for the identihcation and removal 
of software errors. Several other sources estimate that over 
50% of software development costs are spent on debugging 


and testing (Hailpern and Santhanam 20021. 


The need for better debugging tools has long been recog¬ 
nized. The goal of automating various debugging tasks has 
motivated a large body of research in the software engineer¬ 
ing community. However, this line of work has only recently 
begun to take advantage of recent advances in probabilistic 
models, and their inference and learning algorithms. 

In this work, we apply state-of-the-art probabilistic meth¬ 
ods to the problem of fault localization. We propose 
Tractable Fault Localization Models (TFLMs) that can be 
learned from a corpus of known buggy programs (with the 


bug locations annotated). The trained model can then be 
used to infer the probable locations of buggy lines in a previ¬ 
ously unseen program. Conceptually, a TFLM is a probabil¬ 
ity distribution over programs in a given language, modeled 
jointly with any attributes of interest (such as bug location 
indicator variables, or diagnostic features). Conditioned on 
a specihc program, a TFLM dehnes a joint probability dis¬ 
tribution over the attributes. 

The key advantage of probabilistic models is their ability 
to learn from experience. Many software faults are instances 
of a few common error patterns, such as off-by-one errors 
and use of uninitialized values ( |Brun and Ernst 2004| l. Hu¬ 
man debuggers improve with experience as they encounter 
more of these common fault patterns, and learn to recog¬ 
nize them in new programs. Automated debugging systems 
should be able to do the same. 

Another advantage of probabilistic models is that they al¬ 
low multiple sources of information to be combined in a 
principled manner. The relative contribution of each feature 
determined by its predictive value in the training corpus, 
rather than by a human expert. A TFLM can incorporate as 
features the outputs of other fault localization systems, such 
as the Tarantula hue ( |Jones, Harrold, and Stasko 2002| ) 
of each line. 

In recent years, there has been renewed interest in learn¬ 
ing rich, tractable models, on which exact probabilistic in¬ 
ference can be performed in polynomial time (e.g. Sum- 
Product Networks ; |Poon and Domingos|20I l|l. TFLMs build 
on Relational Sum-Product Networks ( |Nath and Domingos] 
20I5| l to enable exact inference in space and time linear in 
the size of the program. We empirically compare TFLMs to 
the widely-used Tarantula fault localization method, as 
well as the Statistical Bug Isolation (SBI) system, on four 
mid-sized C programs. TFLMs outperform the other sys¬ 
tems on three of the four test subjects. 

Background 

Coverage-based Fault Localization 

Coverage-based debugging methods isolate the bug’s loca¬ 
tion by analyzing the program’s coverage spectrum on a set 
of test inputs. These approaches take the following as input: 

1. a set of unit tests; 

2. a record of whether or not the program passed each test; 
















3. program traces, indicating which components (usually 
lines) of the program were executed when running each 
unit test. 


Using this information, these methods produce a suspicious¬ 
ness score for each component in the program. The most 
well-known method in this class is the Tarantula system 
( |Jones, Harrold, and Stasko 2002| l, which uses the following 
scoring function; 


STarantula (-5) 


Failed{s) 

TotalFailed _ 

Passed(s) . Failed{s) 
TotalPassed ' TotalFailed 


Here, Passed{s) and Failed{s) are respectively the number 
of passing and failing test cases that include statement s, 
and TotalPassed and TotalFailed are the number of pass¬ 
ing and failing test cases respectively. In an empirical evalu¬ 
ation Pones and Harrold 2005[ ), Tarantula was shown 
to outperform previous methods such as cause transitions 
( |Cleve and Zeller 2005 1 , set union, set intersection and near¬ 
est neighbor ( Renieris and Reiss 2003 | l, making it the state of 
the art in fault localization at the time. Since the publication 
of that experiment, a few other scoring functions have been 
shown to outperform Tarantula under certain conditions 
( |Abreu et al. 2009| l. Nonetheless, Tarantula remains the 
most well-known method in this class. 


Probabilistic Debugging Methods 
Per-Program Learning Several approaches to fault local¬ 
ization make use of statistical and probabilistic methods. Li- 
blit et al. proposed several influential statistical debugging 
methods. Their initial approach (Liblit et al. 2003 1 used ii- 
regularized logistic regression to predict non-deterministic 
program failures. (The instances are runs of a program, the 
features are instrumented program predicates, and the mod¬ 
els are trained to predict a binary ‘failure’ variable. The 
learned weights of the features indicate which predicates are 
the most predictive of failure.) In later work (|Liblit et~ar 


|2005| ), they use a likelihood ratio hypothesis test to deter¬ 
mine which predicates (e.g. branches, sign of return value) 
in an instrumented program are predictive of program fail¬ 
ure. Zhang et al. ( |201 l| l evaluate several other hypothesis 
testing methods in a similar setting. 

The SOBER system ( |Liu et al. 2005 [|Liu et al. 2006| l im¬ 
proves on Liblit et al.’s 2005 approach by taking into account 
the fact that a program predicate can be evaluated multiple 
times in a single test case. They learn conditional distribu¬ 
tions over the probability of a predicate evaluating to true, 
conditioned on the success or failure of the test case. When 
these conditional distributions differ (according a statistical 
hypothesis test), the predicate is considered to be ‘relevant’ 
to the bug. The HOLMES system ( |Chilimbi et al. 2009| ex¬ 
tends Liblit et al.’s approach along another direction, analyz¬ 
ing path profiles instead of instrumented predicates. 

Wong et al. ( |2008[ 2012 1 use a crosstab-based statisti¬ 
cal analysis to quantify the dependence between statement 
coverage and program failure. Their approach can be seen 
as a hybrid between the Liblit-style statistical analysis and 
TARANTULA-style spectrum-based analysis. Wong et al. 
(2009| 2012^ also proposed two neural network-based fault 


localization techniques trained on program traces. Ascari et 
al. ( |2009| l investigate the use of SVMs in a similar setting. 

Many of the methods described above operate under the 
assumption that the program contains exactly one bug. Some 
of these techniques have nevertheless been evaluated on pro¬ 
grams with multiple faults, using an iterative process where 
the bugs are isolated one by one. Briand et al. ( |2007| l explic¬ 
itly extend Tarantula to the multiple-bug case, by learn¬ 
ing a decision tree to partition failing test cases. Each par¬ 
tition is assumed to model a different bug. Statements are 
ranked by suspiciousness using a TARANTULA-like scoring 
function, with the scores computed separately for each parti¬ 
tion. Other clustering methods have also been applied to test 
cases; for example, Andrzejewski et al. ( |2007| l use a form of 
LDA to discover latent ‘bug topics’. 

Generalizing Across Programs The key limitation of the 
statistical and machine learning-based approaches discussed 
above is that they only generalize over many executions of 
a single program. Ideally, a machine learning-based debug¬ 
ging system should be able to generalize over multiple pro¬ 
grams (or, at least, multiple sequential versions of a pro¬ 
gram). As discussed above, many software defects are in¬ 
stances of frequently occurring fault patterns; in principle, 
a machine learning model can be trained to recognize these 
patterns and use them to more effectively localize faults in 
new programs. 

This line of reasoning has received relatively little atten¬ 
tion in the automated debugging literature. The most promi¬ 
nent approach is the Lault Invariant Classifier (LIC) of Brun 
and Ernst p004| l. EIC is not a fault localization algorithm 
in the sense of Tarantula and the other approaches dis¬ 
cussed above. Instead of localizing the error to a particular 
line, EIC outputs fault-revealing properties that can guide 
a human debugger to the underlying error. These properties 
can be computed using static or dynamic program analysis; 
EIC uses the Daikon dynamic invariant detector ( |Emst et ak 
200I| l. At training time, properties are computed for pairs 
of buggy and fixed programs; properties that occur in the 
buggy programs but not the fixed programs are labeled as 
‘fault-revealing’. The properties are converted into program- 
independent feature vectors, and an SVM or decision tree is 
trained to classify properties as fault-revealing or non-fault- 
revealing. The trained classifier is then applied to properties 
extracted from a previously unseen, potentially faulty pro¬ 
gram, to reveal properties that indicate latent errors. 

Tractable Probabilistic Models 
Sum-Product Networks 

A sum-product network (SPN) is a rooted directed acyclic 
graph with univariate distributions at the leaves; the internal 
nodes are (weighted) sums and (unweighted) products. 

Definition 1. ( |Gens and Domingos 2013[ ) 

1. A tractable univariate distribution is an SPN. 

2. A product of SPNs with disjoint scopes is an SPN. (The 

scope of an SPN is the set of variables that appear in it.) 

3. A weighted sum of SPNs with the same scope is an SPN, 

provided all weights are positive. 

























































Figure 1: Example SPN over the variables xi, X 2 and x^. 
All leaves are Bernoulli distributions, with the given param¬ 
eters. The weights of the sum node are indicated next to the 
corresponding edges. 


4. Nothing else is an SPN. 

Intuitively, an SPN (fig. [T]l can be thought of as an alter¬ 
nating set of mixtures (sums) and decompositions (products) 
of the leaf variables. If the values at the leaf nodes are set to 
the partition functions of the corresponding univariate dis¬ 
tributions, then the value at the root is the partition function 
(i.e. the sum of the unnormalized probabilities of all possible 
assignments to the leaf variables). This allows the partition 
function to be computed in time linear in the size of the SPN. 

If the values of some variables are known, the leaves cor¬ 
responding to those variables’ distributions are set to those 
values’ probabilities, and the remainder are replaced by their 
(univariate) partition functions. This yields the unnormal¬ 
ized probability of the evidence, which can be divided by 
the partition function to obtain the normalized probabil¬ 
ity. The most probable state of the SPN, viewing sums as 
marginalized-out hidden variables, can also be computed in 
linear time. The hrst learning algorithms for sum-product 
networks used a hxed network structure, and only optimized 
the we ights f Poon and Doming os 201 l|pkmer and Todorovic] 
2012\ [Gens and Domingos 2012 1 . More recently, several 


structure learning algorithms for SPNs have also been pro- 
posed (|Den nis and Ventu ra 2012[[Gens and Domingos 201 3t 
Peharz, Geiger, and Pernkopf 2013^ 

Relational Sum-Product Networks 

SPNs are a propositional representation, modeling instances 
as independent and identically distributed (i.i.d.). Although 
the i.i.d. assumption is widely used in statistical machine 
learning, it is often an unrealistic assumption. In practice, 
objects usually interact with each other; Statistical Rela¬ 
tional Learning algorithms can capture dependencies be¬ 
tween objects, and make predictions about relationships be¬ 
tween them. 

Relational Sum-Product Networks (RSPNs; |Nath and] 
Domingo^|2015 1 generalize SPNs by modeling a set of in¬ 
stances jointly, allowing them to influence each other’s prob¬ 
ability distributions, as well as modeling probabilities of re¬ 
lations between objects. RSPNs can be seen as templates 


for constructing SPNs, much like Markov Logic Networks 
( [Richardson and Domingos |2006[ ) are templates for Markov 
networks. RSPNs also require as input a part decomposi¬ 
tion, which describes the pait-of relationships among the ob¬ 
jects in the mega-example. Unlike previous high-treewidth 
tractable relational models such as TML ( [Domingos and 
[Webb 20T2| |, RSPNs can generalize across mega-examples 
of varying size and structure. 

Tractable Fault Localization 

Tractable Fault Localization Models 

A Tractable Fault Localization Model (TLLM) defines a 
probability distribution over programs in some determinis¬ 
tic language L. The distribution may also model additional 
variables of interest that are not part of the program itself; 
we refer to such variables as attributes. In the fault localiza¬ 
tion setting, the important attribute is a buggy indicator vari¬ 
able on each line. Other informative features may also be 
included as attributes; for instance, one or more coverage- 
based metrics may be included for each line. 

More formally, consider a language whose grammar L = 
(V, E, R, S) consists of: 

• U is a set of non-terminal symbols; 

• E is a set of terminal symbols; 

• i? is a set of production rules of the form a —> /?, where 
a S U and /3 is a string of symbols in U U E; 

• S' € U is the start symbol. 

Definition 2. A Tractable Fault Localization Model for lan¬ 
guage L consists of: 

• a map from non-terminals in V to sets of attribute vari¬ 
ables (discrete or continuous); 

• for each symbol a S U, a set of latent subclasses 
ai,..., afc; 

• ITS, a probability distribution over subclasses of the start 
symbol S; 

• for each each subclass ai of a, 

- a univariate distribution ipai ,x over each attribute x as¬ 
sociated with a; 

- for each rule a —> /3 in L, a probability distribution pa. 
over rules ai —>■ j3', 

- for each rule ai —>■ /3, for each non-terminal a' £ P, a 
distribution TT(ai^p),a' over subclasses of a'. 

The univariate distribution over each attribute may be re¬ 
placed with a joint distribution over all attributes, such as a 
logistic regression model within each subclass that predicts 
the value of the buggy attribute, using one or more other at¬ 
tributes as features. However, for simplicity, we present the 
remainder of this section with the attributes modeled as a 
product of univariate distributions, and assume that the at¬ 
tributes are discrete. 

Being defined over the grammar of the programming lan¬ 
guage, TLLMs can capture information extremely useful for 
the fault localization task. Lor example, a TLLM can repre¬ 
sent different fault probabilities for different symbols in the 
grammar. In addition, the latent subclasses give TLLMs a 


























degree of context-sensitivity; the same symbol can be more 
or less likely to contain a fault depending on its latent sub¬ 
class, which is probabilistically dependent on the subclasses 
of ancestor and descendent symbols in the parse tree. This 
makes TFLMs much richer than models like logistic regres¬ 
sion, where the features are independent conditioned on the 
class variable. Despite this representational power, exact in¬ 
ference in TFLMs is still computationally efficient. 

Example 1. The following mles are a fragment of the gram¬ 
mar of a Python-like language: 

while_stmt —> ’while’ condition ’:’ suite 
condition —>■ expr operator expr 
condition —>■ ’not’ condition 

We refer to the above rules as ri, r 2 and respectively. 

The following is a partial specification of a TFLM over 
this grammar, with the while_stmt symbol as root. 

• All non-terminal symbols have buggy and suspiciousness 
attributes, buggy is a fault indicator, and suspiciousness is 
a diagnostic attribute, such as a Tarantula score. 

• Each non-terminal has two latent subclass symbols. For 
example, while.stmt has subclasses while_stmti and 
while_stmt 2 . 

• The distribution over start symbols is: 
7r(while_stmti) = 0.4, 7r(while_stmt2) = 0.6. 

• For subclass symbol while_stmti (subclass subscripts 
omitted): 

- '’pbuggy Bernoulli{0.01) 

~ '^suspiciousness ^ A^(0.4, 0.05) 

- p{ri) = 1.0, since while.stmt has a single rule. 

- The distributions over child symbol subclasses 

for ri are: 7rri,condition(conditioni) = 
0.7, (condit:ioii 2 ) 0.3, 

,suite (suit:©]^) — 0.2, (suit©2) — 0.8. 

(The complete TFLM specihcation would have similar defi¬ 
nitions for all the other subclass symbols in the model.) 

Semantics Conceptually, a TFLM defines a probability 
distribution over all programs in L, and their attributes. More 
formally, the joint distribution P(T, A, C) is defined over: 

• a parse tree T; 

• an attribute assignment A, specifying values of all at¬ 
tributes of all non-terminal symbols in T; 

• latent subclass assignment C for each non-terminal in T. 

For parse tree T containing rules ri = 

ai ->■ /3i,r2 = a2 ->■ /32,...,rn = 

an —> Pn, and root symbol an, P{T,A,C) = 

rir-i ^PC{ai){o^i Pi) ^ '^(C(ai)—>-/3i),ct'(C'(Q! )) X 

llooeattrio,)'Pc{c.,)AM^))^ ^ 7rs(C'(afl)) 

Inference Like RSPNs, inference in TFLMs is performed 
by grounding out the model into an SPN. The SPN is con¬ 
structed in a recursive top-down manner, beginning with the 
start node: 


• Emit a sum node over subclasses for the started node, 
weighted according to its- Let the current symbol a be 
the start symbol, and let be its subclass. 

• Emit a product node with one child for each attribute of 
a, and a child for the subprogram rooted at a. 

• Eor each attribute x of a, emit a sum node over the at¬ 
tribute values for the current symbol, weighted by ipai,x- 

• Emit a sum node over production mles a ^ P for the cur¬ 
rent symbol, according to Pa.. (Note that when grounding 
a TELM over a known parse tree, all but one child of this 
sum node is zeroed out, and need not be grounded.) 

• Recurse over each non-terminal a' in P, choosing its sub¬ 
class via a sum node weighted by 'K(ai^p),a'■ 

Learning The learning problem in TELMs is to estimate 
the TT and p) distributions from a training corpus of programs, 
with known attribute values but unknown latent subclasses. 
(The p distributions have no effect on the distribution of in¬ 
terest, since we assume that every program can be unam¬ 
biguously mapped to a parse tree.) 

As is commonly done with SPNs, we train the model via 
hard EM. In the E-step, given the current parameters of tt 
and ijj, we compute the MAP state of the training programs 
(i.e. the latent subclass assignment that maximizes the log- 
probability). In the M-step, we re-estimate the parameters 
of TT and '0, choosing the values that maximize the log- 
probability. These two steps are repeated until convergence, 
or for a fixed number of iterations. 

If the attributes are modeled jointly rather than as a prod¬ 
uct of univariate distributions, retraining the joint model in 
each iteration of EM is computationally expensive. A more 
efficient alternative is to use a product of univariates during 
EM, in order to learn a good subclass assignment. The joint 
model is then only trained once, at the conclusion of EM. 


Experiments 

We performed an experiment to determine whether TELM’s 
ability to combine a coverage-based fault localization sys¬ 
tem with learned bug patterns improves fault localization 
performance, relative to using the coverage-based system di¬ 
rectly. As a representative coverage-based method, our study 
used Tarantula, one of the most widely-used approaches 
in this class, and a common comparison system for fault lo¬ 
calization algorithms. We also compared to the statement- 
based version of Liblit et al.’s Statistical Bug Isolation (SBI) 
system ( Liblit et al. 2005| l, as adapted by Yu et al. ( 2008) 1. 
SBI serves as a representative example of a lightweight sta¬ 
tistical method for fault localization. 


Subjects 

We evaluated TELMs on four mid-sized C programs (ta- 
ble [T) from the Softw are-artifact Infrastructure Repository 
( Khurshid et al.|2()04 i. All four test subjects are real-world 
programs, commonly used to evaluate fault localization ap¬ 
proaches. The repository contained several sequential ver¬ 
sions of each program, each with several buggy versions. 
The repository also contained a suite of between 124 and 









Table 1: Subject programs 


Table 4: TFLM learning and inference times 


Program 

Versions 

Executable LOC 

Buggy vers. 

Program 

Avg. learn time (s) 

Avg. infer time (s) 

grep 

4 

3368 ± 122 

8±5 

grep 

1135.00 

20.91 

gzip 

5 

1905 ± 124 

7±3 

gzip 

433.33 

5.25 

flex 

5 

3907 ± 254 

10 ±4 

flex 

978.63 

13.15 

sed 

7 

2154 ± 389 

3 ± 1 

sed 

326.37 

5.18 


Table 2: Localization accuracy (fraction of lines skipped) 


Program 

TELM 

Tarantula 

SBI 

grep 

0.645 

0.640 

0.564 

gzip 

0.516 

0.682 

0.540 

flex 

0.770 

0.704 

0.618 

sed 

0.927 

0.851 

0.603 


525 TSL tests for each version, which we used to compute 
the Tarantula scores. 

The number of executable lines was measured by the gcov 
tool. We excluded buggy versions where the bug occurred in 
a non-executable line (e.g. lines excluded by preprocessor 
directives), or consisted of line insertions or deletions. Un¬ 
like most previous fault localization studies that use these 
subjects, we do not exclude versions for which the test re¬ 
sults were uniform (i.e. consisting entirely of passing or fail¬ 
ing tests). Although coverage-only methods such as Taran¬ 
tula can provide no useful information in the case of uni¬ 
form test suites, TFLMs can still make use of learned con¬ 
textual information to determine that some lines are more 
likely than others to contain a fault. 


Methodology 

We implemented TFLMs for a simplified version of the C 
grammar with 23 non-terminal symbols, ranging from com¬ 
pound statements like if and while to atomic single-line 
statements such as assignments and break and continue 
statements. Each symbol has a buggy attribute, and a 
suspiciousness attribute, which is the Tarantula score 
of the corresponding line. (For AST nodes that correspond 
to multiple lines in the original code, we use the highest 
Tarantula score among all lines). As described in the 
previous section, the attributes are modeled as independent 
univariates during EM (buggy as a Bernoulli distribution, 
and suspiciousness as a Gaussian), and then via a logis¬ 
tic regression model within each subclass. The model pre¬ 
dicts the buggy attribute, using the Tarantula score and a 


bias term as features. We use the SCIKIT-LEARN (Pedregosa 
'et al. 201T|| implementation of logistic regression, with the 


class_weight='auto' parameter, to compensate for 


Table 3; TELM (with Tarantula feature) vs Tarantula alone 


Program 

TELM wins 

Ties 

Tarantula wins 

grep 

18 

0 

14 

gzip 

11 

0 

23 

flex 

31 

0 

18 

sed 

17 

0 

7 


the sparsity of the buggy lines relative to bug-free lines. Eor 
TELMs, we ran hard EM for 100 iterations. Eor each subject 
program, TELMs were learned via cross-validation, training 
on all versions of the program except the one being evalu¬ 
ated. The number of latent subclasses was also chosen via 
cross-validation, from the range [1,4]. 

The output of a fault localization system is a ranking of 
the lines of code from most to least suspicious. Eor TELMs, 
we ranked the lines by predicted probability that buggy = 1. 
(Each line in the original program is modeled by the finest- 
grained AST node that encloses it.) The evaluation met¬ 
ric was the ‘fraction skipped’ (ES) score, i.e. the fraction 
of executable lines ranked below the highest-ranked buggy 
line. Despite its limitations (|Pamin and Orso 201 1[), this is a 


widely-used metric for fault localization (Jones and Harrold 
[2005l|Abreu et al. 2009^ . 


Results 

Results of our experiments are displayed in tables [^ and [^ 
and figure [^ TELMs outperform Tarantula and SBI on 
three of the four subjects, isolating the majority of bugs more 
effectively, and earning a higher average ES score. However, 
TELMs perform poorly on the gzip domain. This demon¬ 
strates the main threat to the validity of our method: machine 
learning algorithms operate under the assumption that the 
test data is drawn from a similar distribution to the training 
data. If the bugs occur in different contexts in the training 
and test datasets (as in gzip), learning-based methods may 
perform worse than methods that try to localize each pro¬ 
gram independently. This risk is particularly great when the 
learning from a small corpus of buggy programs. 

However, in three of the four subjects in our experiment, 
the training and test distributions are sufficiently similar to 
allow useful generalization, resulting in improved fault lo¬ 
calization performance. TELMs’ advantage arises from its 
ability to localize faults even when the coverage matrix used 
by Tarantula does not provide useful information (e.g. 
when the tests are not sufficiently discriminative). TELMs 
combine the coverage-based information used by Taran¬ 
tula with learned bug probabilities for different symbols, 
in different contexts. Context sensitivity is captured via la¬ 
tent subclass assignments for each symbol. 

As seen in table our unoptimized Python implementa¬ 
tion predicts bug probabilities in a few seconds for programs 
a few thousand lines in length. An optimized implementa¬ 
tion may be able to make predictions at interactive speeds; 
this makes TELMs a practical choice of inference engine for 
a debugging tool in a software development environment. 
Learning TELMs can take several minutes, but note that the 
model can be trained offline, either from previous versions 



























(a) grep 



(b) gzip 



Figure 2: The horizontal axis is the fraction of lines skipped (FS), and the vertical axis is the fraction of runs for which the FS 
score equalled or exceeded the x-axis value. 


of the software being developed (as in our experiments), or 
from other related software projects expected to have a sim¬ 
ilar bug distribution (e.g. projects of a similar scale, written 
in the same language). 


Conclusions 

This paper presented TFLMs, a probabilistic model for fault 
localization that can be learned from a corpus of buggy pro¬ 
grams. This allows the model to generalize from previously 
seen bugs to more accurately localize faults in a new con¬ 
text. TFLMs can also incorporate the output of other fault- 
localization systems as features in the probabilistic model, 
with a learned weight that depends on the context. TFLMs 
take advantage of recent advances in tractable probabilistic 
models to ensure that the fault location probabilities can be 
inferred efficiently even as the size of the program grows. 
In our experiments, a TFLM trained with Tarantula as a 
feature localized bugs more effectively than Tarantula or 
SBI alone, on three of the four subject programs. 

In this work, we used TFLMs to generalize across sequen¬ 
tial versions of a single program. Given adequate training 
data, TFLMs could also be used to generalize across more 
distantly-related programs. The success of this approach re¬ 
lies on the assumption that there is some regularity in soft¬ 
ware faults, i.e. the same kinds of errors occur repeatedly 


in unrelated software projects, with sufficient regularity that 
a machine learning algorithm can generalize over these pro¬ 
grams. Testing this assumption is a direction for future work. 

Another direction for future work is extending TFLMs 
with additional sources of information, such as including 
multiple fault localization systems, and richer program fea¬ 
tures derived from static or dynamic analysis (e.g. invari¬ 
ants ( |Hangal and Lam 2002[ Bmn and Ernst 2004) l). TFLM- 
like models may also be applicable to debugging methods 
that use path profiling ( jChilimbi et al. 2009| l, giving the 
user more contextual information about the bug, rather than 
just a ranked list of statements. The recent developments in 
tractable probabilistic models may also enable advances in 
other software engineering problems, such as fault correc¬ 
tion, code completion, and program synthesis. 
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